Non-parametric segmentation of non-stationary time series 
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The non-stationary evolution of observable quantities in complex systems can frequently be de- 
scribed as a juxtaposition of quasi-stationary spells. Given that standard theoretical and data 
analysis approaches usually rely on the assumption of stationarity, it is important to detect in real 
time series intervals holding that property. With that aim, we introduce a segmentation algorithm 
based on a fully non-parametric approach. We illustrate its applicability through the analysis of 
real time series presenting diverse degrees of non-stationarity, thus showing that this segmentation 
procedure generalizes and allows to uncover features unresolved by previous proposals based on the 
discrepancy of low order statistical moments only. 

PACS numbers: 05.40.2a, 05.45.Tp, 89.75.-k 



I. INTRODUCTION 

Complex systems are seldom in equilibrium or even in 
stationary states; however, their evolution can in many 
cases be thought as being composed of spells of quasi- 
stationarity in which time-varying pseudo-parameters 
can be considered unchanged. Examples of such frame- 
work can be found in finance [l| , biology [Ij , physics 0, 01 
and physiology 0, just to mention a few areas. By 
identifying stationary segments, one can apply stan- 
dard techniques, e.g., extracting stochastic equations 
(Kramers-Moyal coefficients) from data 0, overcoming 
the difficulties of non-stationary treatments 0, As 
another application, a proper segmentation is important 
to assess the scenario of mixed statistics Q based on the 
idea of local equilibrium, typically applied by considering 
a fixed characteristic scale (window length). In general, 
segmentation provides a useful portrait of the local sta- 
tistical properties for modeling non-stationary systems. 

In order to identify such quasi-stationary patches, al- 
gorithms based on standard statistical methodology have 
been introduced. Explicitly, they lean on moving along 
the series a pointer to detect the position that maximizes 
a given quantifier of the statistical discrepancy between 
the segments on both sides of the pointer. Among oth- 
ers 0, worth of mention are the algorithms based on 
the Student's ^-statistic (used to test the sig nificance of 
the null hypothesis of equal means) [l^, El or on the 
Jensen- Shannon divergence in the case of symbolic se- 
quences [l2j . 

Despite the interesting results provided by these meth- 
ods [l^-[T9j. limitations hampering the performance can 
be found in every of them. On the one hand, in the 
statistical moments criteria there is the problem of boil- 
ing down the existence of non-stationarity to the change 
of pre-established local quantities. For instance, even if 
the time series presents fluctuations in the variance, the 
i-test may give us the indication that the series is sta- 
tionary. Although it could be improved through the un- 
equal variance t-test statistic or also through an _F-tcst, 
it will still rely on assumptions over the moments and on 



the validity of the central limit theorem. On the other 
hand, entropy based methods are more fitted to sym- 
bolic sequences, while information is lost if discretizing a 
real valued series by means of thresholds. Moreover, the 
segmentation stopping criteria can be deemed arbitrary. 
Its proposed improvement by means of the Bayesian In- 
formation Criterion can be disputed as well since such 
a criterion often favors minimalist modeling (20j . With 
the aim of surmounting those difficulties, we introduce a 
fully non-parametric segmentation approach by using the 
Kolmogorov-Smirnov (KS) statistic, Dks, which mea- 
sures the maximal distance between the cumulative dis- 
tributions of two samples, as estimate of the discrepancy 
between segments. Note that it allows to test whether 
two samples come from the same distribution with no 
need to specify which is the common distribution. 



II. KS-SEGMENTATION ALGORITHM 

Our algorithm (named KS-segmentation) works as fol- 
lows. Given a segment of a time scries, {xi, ii < i < ««}, 
a sliding pointer, at i = ip, is moved in order to com- 
pare the two fragments 5*^ = {xi-^ , ■ • ■ , } and Sr = 
{ccip+i, . . . , Xi^ }. The position ip of the pointer is moved 



so that the sizes of the two segments (n^ ~ 



-1 and 



n-R = in — ip) are at least unitary. Then, one selects the 
position imax that maximizes the Kolmogorov-Smirnov 
(KS) statistic D = DxsiMnh + 1/nH)"^/^, between the 
two patches Sh and Sr. 

Once found the position imax of the maximal distance 
Z?, j}rn.ax ^ Qj^g chccks the statistical significance (at a 
chosen significance level a = 1 — Pq) of a potentially 
relevant cut at imax by comparison with the result that 
would be obtained was the sequence random [loj . The 
potential cut ticks the first stage if D™'^^ exceeds its crit- 
ical value, D^f^, for the selected significance level (see 
Fig. [7| and further technical details in App. Before 
final acceptance of the cut, one can still require a min- 
imal size (number of points) £o, namely, imax — ii + 1, 
in ^imax ^ ^0- The procedure is then recursively applied 
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starting from the full series {xi, I < i < N}, where N 
is the total number of data points, until no segmentable 
patches arc left. The search for D™"^ within a given 
segment {ii, . . . , i„} during the iterations, as well as in 
the determination of the critical curves, is performed for 
ii < ip < in — 1. The outcome of this segmentation 
procedure when applied to paradigmatic non-stationary 
time series is hereafter presented. 



III. APPLICATIONS 

We first survey the segmentation of heart-rate (RR) 
series, which motivated the introduction of the segmen- 
tation algorithm based on the discrepancy of the means 
(mean-based algorithm) [l^ and that have also been 
suggested as a common focus to solve the controversy 



over the potential chaoticity of normal heart rate [21 



Namely, we revisit the study of interbeat time series from 
healthy individuals (nor) and patients with congestive 
heart failure (chf ) [22[- Time series (tagged nlnn-n5nn 
and clnn-c5nn for (nor) and (chf), respectively) are 
about 24 hours long and had their outliers removed. The 
segmentation outcome is depicted in Fig. [T] 
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FIG. 1: (Color online) Fragment of a heart rate time series 
for a healthy individual (n5nn.txt) (light gray lines). The 
mean +/— the standard deviation of the segments resulting 
from the KS-segmentation, with £o — 50 and Pq — 0.95, are 
displayed. 

We computed the first moments for each resulting seg- 
ment. The variance is deemed not constant throughout 
segments, but it is dispersed over more than one decade 
as illustrated in Fig. [2l for all segment sizes. The lo- 
cal variance is larger for the healthy subjects. Thence, 
equal variance cannot be assumed as in previous anal- 
ysis of heart-rate series [lo| . Furthermore, our finding 
implies that if one keeps such a simpler analysis, at least 
the effective degrees of freedom in the t-test should be ob- 
tained by the Welch-Sattherthwaite equation. It is worth 
referring that despite the tendency for the variance to be 
larger in the nor group, the quotient for consecutive seg- 
ments is very similarly distributed in both groups, with 
a slow power-law decay (with exponent close to -3) (not 
shown) . 
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FIG. 2: Local variance vs segment length for a representa- 
tive individual of each group (left). Probability distribution 
function (PDF) of the variance for all individuals of the nor 
(black lines) and chf (gray lines) groups (right panel). 



The complementary cumulative distributions of seg- 
ment sizes for nor and chf individuals are displayed in 
Fig. [3] For both groups, the plots can be described by a 
double exponential ^e~('"^"'/-^i + (l-A)e"('"^")/-^2 ^[^]^ 

characteristic lengths Li ~ 70 and L2 — 370. Hence 
there is no indication of a scale-free behavior, as sug- 
gested by previous se gme ntation analysis through the 
mean-based algorithm |10 |. 




FIG. 3: Cumulative distribution of segment sizes for nor 
(left panel) and chf (right panel) individuals. In each panel, 
the thin lines correspond to each individual of the group (5 
samples of 24-hour data), the dark dotted line to the en- 
tire group and the dark full line to fits to Ae~'-''~^°^^^^ + 
(1 — ^)e~''~^°''^^^ with amplitude and characteristic lengths 
{A,Li,L2) = (0.78,78,372) and (0.86,64,373) respectively. 

Our next example concerns the scenario of mixed 
statistics that has been applied to the study of fluid tur- 
bulence d, H^. Besides turbulence, its relevance is high- 
lighted by the fact that several models for finance have 
been inspired by this physical problem 2j]. Succinctly, 
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the mixed approach corresponds to a conjecture where 
one has a classical Boltzmann-Gibbs statistics, condi- 
tioned to given temperature (T ^ /3~-^ ^ cr^), which 
signals the existence of local equilibrium, that is associ- 
ated with certain distribution P{l3). Nonetheless, up to 
now, the approaches to the problem have considered the 
existence of a single scale of local equilibrium, which can 
be seen as a first step for an outright description [l^ . 
Endowed with our segmentation algorithm, we are in po- 
sition to evaluate the distribution of local time scales, r, 
of the /3 factor and verify the local equilibrium assump- 
tion in this type of system. As example, we consider a 
series of wind velocities (one month of measurements at 
a 30s acquisition interval) [11]. We reduced the strong 
daily periodicity by scaling the data by the average at 
each time of the day. Then velocity v is dimensionless. 

Accordingly, in a segment within which local equi- 
librium holds, the velocity distribution is defined by 



wherefrom the distribution of 
Along 



p {v\l3) = exp 

the speed, v = \v\^ is = exp 

these lines, we apply our segmentation procedure pitch- 
ing at detecting the time intervals where the local equi- 
librium approximation is valid. The result of the seg- 
mentation is depicted in Fig. 21 The complementary 
cumulative distribution of segments decays more slowly 
than exponentially (plausibly a stretched or double ex- 
ponential, the latter with characteristic times of 32 min 
and 93 min). The distribution of segment lengths has 
mean (standard deviation) approximately equal to 129 
(91) points (corresponding to 64 (45) min approx.). One 
observes in Fig. [5] that the 2D-Maxwell distribution fails 
in describing the distribution of velocities, because the lo- 
cal variance is dispersed (inset of Fig. [5]). Then we con- 
sidered the mixing p(u) = dcr^p (wjcr^) p((T^), where 
p(w|cr^) is the Maxwell distribution defined above, substi- 
tuting /? = (4 — tt) / (2cr^), given that the (conditioned) 



raw moments are (^;")^ = (2//3)"/^ F [1 n/2]. One 
observes that the mixed distribution is in good accord 
with the data distribution, once local trends are removed. 
The mixed distribution also agrees with the histogram 
built from artificial series obtained by juxtaposition of 
sequences of Maxwellian random numbers, with the same 
length and local variance as the real ones. 

Afterwards, we have computed the local variance that 
is proportional to the inverse /3 factor. Its distribution, 
presented in the inset of Fig. [51 is responsible for the de- 
viation of p (w) from the 2D-Maxwcll distribution (main 
frame of Fig. [5]). 
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FIG. 5; Distribution of velocity v (circles). Data were de- 
trended by substracting the excess average with respect to 
the one given by the Maxwell distribution with the same lo- 
cal variance (black circles). Distribution of an artificial series 
with the same segments as the real one, with values of v in- 
dependently drawn from 2D-MaxwelI distributions with the 
local variance of the real series ( gray circles). Thin lines cor- 
respond to the respective 2D-Maxwell distributions, with the 
variance of the whole series, and the full line to the mixing 
(numerically summed up) of the 2D-Maxwell with the distri- 
bution of the local variance obtained from the segmentation 
process, shown in the inset. 
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FIG. 4: (Color online) Time series oi v = \v\ (dimensionless). 
The mean ~f/— the standard deviation of the segments re- 
sulting form the KS-segmentation are displayed (for Iq = 50, 
Po = 0.95). 



Besides, the speed, w, we also looked at angle variations 
(at 30s lag). While a more quantitative analysis on this 
matter is addressed to future work, here we would like to 
call attention to the fact that, even if the average value is 
constantly close to zero, changes in the local variance are 
detected by the present method, as depicted in Fig. [51 
while no segmentation occurs with the procedure based 
on the discrepancy of the means. 



IV. FINAL REMARKS 

In this manuscript we have presented a segmenta- 
tion method which aims at coping with non-stationary 
signals from widespread physical and non-physical sys- 
tems where the local-equilibrium or local-stationarity 
hypotheses hold. Our method, which is based on the 
Kolmogorov-Smirnov test, improves previous proposals 
as soon as it is non-parametric and thus independent of 
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FIG. 6; (Color online) Segmentation of angular deviation at 
30s lag. The mean +/— the standard deviation of the seg- 
ments resulting from the KS-segmentation are displayed. 



with (a,6,c) equal to (1.41,1.74,0.15), (1.52,1.8,0.14) 
and (1.72,1.86,0.13) for Pq = 0.90, 0.95 and 0.99, re- 
spectively. 
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any pre-assumed order of fluctuation between the station- 
ary segments, resulting into a more flexible and effectual 
algorithm. 

Concerning algorithmic complexity, our algorithm is 
more efficient than methods based on matrix diagonal- 
ization, such as principal component analysis. In com- 
parison with moment based segmentation proposals, our 
algorithm requires sorting segments of length n which in- 
creases the complexity in a factor of order Inn. which 
does not represent a significant larger computational 
cost, despite the enhanced ability. 

The applicability of our proposal was then tested with 
two poles apart signals, heart-rate intervals and wind ve- 
locities, with significant results in each case. In the first 
case, the non-stationarity portrait is altered with respect 
to that of previous analysis based on the discrepancy of 
the means, as soon as the local variance cannot be as- 
sumed constant. In the second, the procedure is shown 
to be useful to detect meaningful windows to compute lo- 
cal statistical quantities. In general, proper segmentation 
is helpful in several problems where local-stationarity ap- 
plies. 



Appendix A 

1. Statistical significance criterion 

We determined D'""^' numerically for a large number 
(> 10^) of sequences of i.i.d. Gaussian numbers and 
built its cumulative distribution. From the cumulative 
distribution, we obtained the critical values of D™°'^{N), 
D™1^{N), for each given significance level a = 1 — Pq- 
The resulting critical curves are shown in Figure [Tj For 
the significance tests applied throughout the segmenta- 
tion procedure, we used the effective form of the critical 
curves given by the heuristic simple expression 

i?rrW=«(lniV-6)^ (Al) 



FIG. 7: Critical values of 75™'"^ as a function of the sequence 
length for series of Gaussian i.i.d. random numbers, at 
different significance levels Po indicated on the figure. Full 
lines are fits to the data, used as phenomenological formulae 
for significance checking. 

We noticed that, along a random series, the position 
*max for which D is maximal is not uniform but presents a 
U-shaped distribution. This could set forth a bias in the 
cutting performance propping up an increase in the num- 
ber of short segments. Let us mention that in the mean- 
based algorithm an alike U-shape is also present. To 
avoid this effect, we tested a redefinition of the standard 
KS distance, by considering D = DKs{^/n\ + 1/712)"'^, 
with arbitrary 7 and observed that the flattest (more 
uniform) distribution of imax, for any size N ^ occurs for 
7 ~ 0.64. We compared the implementation of the algo- 
rithm both values of 7 (0.5 and 0.64), however, both for 
real and artificial series no significant differences in the 
segmentation portrait were observed. Then we kept the 
standard definition of D. 



1. Testing artificial series 

To check the performance of the algorithm, we ana- 
lyzed artificial series {yi,l < i < N}, formed by seg- 
ments of n Gaussian numbers. We set unitary jumps in 
the means of consecutive segments (Ay = 1) and alter- 
nating standard deviation (square root of the variance) 
CTi , CT2 , as illustrated in Fig. [51 We varied each standard 
deviation from 1/10 to 10, then embracing a wide range 
of values relative to the jump size. 

Diagrams of the segmentation results in the plane cti , 
(T2 are shown in Figs.l^andfTUl for £0 — 10 and 50, respec- 
tively, at level Pq = 0.95. For each sequence, the percent 
relative number of cuts with respect to the actual one is 
displayed. The outcomes of the KS-algorithm are shown 
in the right-hand side panels and those of the mean-based 
algorithm are also presented (left-hand side panels) for 
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FIG. 8: Artificial series (lightgray lines) formed by segments 
of n = 200 Gaussian numbers with alternating means +1/2, 
— 1/2 and standard deviation cri, cr2 (values indicated on each 
panel). Segmentation was performed by means of the KS- 
algorithm at level Pq ~ 0.95. The vertical dotted lines indi- 
cate the exact (magenta) and calculated (black) borders. 
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FIG. 9: (Color online) Segmentation diagram in the param- 
eter plane cti , (T2 . The percentual relative number of cuts is 
represented in a palette mapping. Each grid cell corresponds 
to a different random sequence of size A'^ = 4000 and segment 
sizes n — 100 (a,d), 200 (b,e) and 400 (c,f). Segmentation 
was performed with the mean-based (a-c) and KS (d-f) algo- 
rithms, with io = 10 and Po = 0.95. 



comparison. Time series belonging to the cyan (light) 



FIG. 10: (Color online) Segmentation diagram in the param- 
eter plane ai,a2, as in Fig. [9] but with with £o ~ 50. 



regions are correctly (100%) segmented, while those be- 
longing blue (dark) regions are typically unsegmentable. 
Red cells indicate oversegmentation. 

The mean-based algorithm performs proper segmenta- 
tion only if the standard deviations are at most of the 
order of the size of the jumps and works well, within the 
chosen confidence level, around de diagonal (cti ~ (T2), as 
expected. Meanwhile, the KS algorithm is able to seg- 
ment series in a larger region of parameter space. Seg- 
mentation fails when the standard deviations of consecu- 
tive segments are not significantly different and are larger 
than the jump size (ai ~ (T2 > Ay). In both procedures, 
for larger segment sizes n, the segmentable domain en- 
larges, but more false cutting points arise. By setting a 
larger value of £0, small segments are discarded and the 
number of false cuts is reduced, as can be seen by com- 
parison of Figs. [9] and [10] which just differ in the value 
of Iq. Of course, false cutting points can also be reduced 
by increasing the value of Pq (compare the second row of 
Fig. [To] with Fig. [Til that only differ in the value of Pq. 

Moreover, one could still improve the algorithm by 
adding a further final step, which is the following one. 
Before accepting a cut, check through the standard KS 
test the significance of the discrepancy between the right- 
hand side portion and its right neighboring segment 
(born in the previous generation) as well as the left-hand 
side portion with its left neighbor, as it has been proposed 
for the mean-based segmentation (T^]. For the analyzed 
scries, this step does not introduce a significant improve- 
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FIG. 11: (Color online) Segmentation diagram in the param- 
eter plane a\,a2, as in Fig. [9jb,e) (i.e., n = 200, I — 10) but 
with with Po = 0.99. 
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FIG. 12: (Color online) Segmentation diagram in the param- 
eter plane ai,a2, as in Fig. [9jb,e) (i.e., n = 200, £ = 10, 
Po = 0.95), with the additional neighbor segment check. 



mcnt (sec Fig. [12] to be compared with the second row 
of Fig.[9|), however, if, depending on the analyzed series, 
one observes oversegmentation, then that step could be 
straightforwardly added. 

Let us comment that as one approaches the frontier 
of the segmentable region, although segments are recog- 
nized, the position of the boundaries of the segments gets 
more imprecise. Effect which is reduced by increasing the 
confidence level. 

The above diagrams depict the scope of the algorithm 
for a particular class of artificial series but provides a feel- 
ing of its range of applicability and limitations. It also 
manifests the importance of our method in enlarging the 
domain of segmentation. There is an infinity of other 
tests, e.g., with diverse variabilities of means and vari- 
ances, correlations, other statistics discrepancies, that 
could be performed. Also tests restricted to the compar- 
ison of the outcoming statistics could be carried out . 
However, when applying this or other algorithm, it may 
be convenient to perform ad-hoc test, depending on the 
particular statistical characteristics of the analyzed se- 
ries. 



3. Robustness 

For the analyzed series, we checked the robustness of 
the results with respect to the significance level {Pq = 



0.90,0.95,0.99). The smaller Pq, the larger the tendency 
to allow small segments, while larger ones are almost un- 
affected. However, this effect does not change signifi- 
cantly the statistics of segment sizes, as illustrated for 
heart-rate series in Fig. [13] (right panel). The impact of 
£q was also checked (left panel of Fig. [T5)) . Slopes do 
not significantly change, except for small values oi I, as 
expected, since smaller fragments are allowed with de- 
creasing £(). Notwithstanding, the probability density of 
larger segments is not significantly altered. Let us no- 
tice that the statistics on segments may be affected by 
the increase of small segments if the studied quantity is 
correlated with the size. However, this does not seem to 
happen in the analyzed cases. 




FIG. 13: Left panel: Complementary cumulative distribution 
of segment sizes obtained for different values of the minimal 
length £o indicated on the figure. Data correspond to a nor- 
mal individual (n5nn.txt). Right panel: Complementary cu- 
mulative distribution of segment sizes at different significance 
levels indicated on the figure. 

We also checked that the same segmentation patches 
are typically recovered even when analyzing small frag- 
ments of the whole series. In fact, significant statistical 
jumps even for small segments are recognized, prompting 
a segmentation point. 
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